This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Tue Dec 24 15:51:46 2024.
Data Description:
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Relevant Paper:
Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
## Import required packages
packages <- c("tidyverse", "lubridate", "timetk", "forecast", "prophet", "plotly", "ggcorrplot")
installed_packages <- installed.packages()
for (pkg in packages) {
if (!(pkg %in% installed_packages)) {
install.packages(pkg)
}
}
lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Loading required package: Rcpp
##
## Loading required package: rlang
##
##
## Attaching package: 'rlang'
##
##
## The following objects are masked from 'package:purrr':
##
## %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
## flatten_raw, invoke, splice
##
##
##
## Attaching package: 'plotly'
##
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
##
## The following object is masked from 'package:stats':
##
## filter
##
##
## The following object is masked from 'package:graphics':
##
## layout
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "timetk" "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [7] "readr" "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "forecast" "timetk" "lubridate" "forcats" "stringr" "dplyr"
## [7] "purrr" "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [13] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
## [19] "base"
##
## [[5]]
## [1] "prophet" "rlang" "Rcpp" "forecast" "timetk" "lubridate"
## [7] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [13] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [19] "utils" "datasets" "methods" "base"
##
## [[6]]
## [1] "plotly" "prophet" "rlang" "Rcpp" "forecast" "timetk"
## [7] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [13] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [19] "grDevices" "utils" "datasets" "methods" "base"
##
## [[7]]
## [1] "ggcorrplot" "plotly" "prophet" "rlang" "Rcpp"
## [6] "forecast" "timetk" "lubridate" "forcats" "stringr"
## [11] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [16] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [21] "utils" "datasets" "methods" "base"
# Load the Dataset
# Set the path to the files and read the CSV data
day_data <- read.csv("day.csv")
hour_data <- read.csv("hour.csv")
# Initial Glimpse of the Data
glimpse(day_data)
## Rows: 731
## Columns: 16
## $ instant <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday <chr> "2011-01-01", "2011-01-02", "2011-01-03", "2011-01-04", "20…
## $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ weekday <int> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
## $ workingday <int> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ weathersit <int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
## $ temp <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
## $ atemp <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
## $ hum <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
## $ windspeed <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
## $ casual <int> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
## $ registered <int> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
## $ cnt <int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…
glimpse(hour_data)
## Rows: 17,379
## Columns: 17
## $ instant <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday <chr> "2011-01-01", "2011-01-01", "2011-01-01", "2011-01-01", "20…
## $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ hr <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ holiday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weekday <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ workingday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ weathersit <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,…
## $ temp <dbl> 0.24, 0.22, 0.22, 0.24, 0.24, 0.24, 0.22, 0.20, 0.24, 0.32,…
## $ atemp <dbl> 0.2879, 0.2727, 0.2727, 0.2879, 0.2879, 0.2576, 0.2727, 0.2…
## $ hum <dbl> 0.81, 0.80, 0.80, 0.75, 0.75, 0.75, 0.80, 0.86, 0.75, 0.76,…
## $ windspeed <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0896, 0.0000, 0.0…
## $ casual <int> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41, 1…
## $ registered <int> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70, 5…
## $ cnt <int> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, 110…
# Set up a working directory (optional but recommended)
# setwd("bike_sharing_dataset/")
# Initial Exploration of Daily Data
cat("\n--- Summary of day_data ---\n")
##
## --- Summary of day_data ---
summary(day_data)
## instant dteday season yr
## Min. : 1.0 Length:731 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 Class :character 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Mode :character Median :3.000 Median :1.0000
## Mean :366.0 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
cat("\n--- Structure of day_data ---\n")
##
## --- Structure of day_data ---
str(day_data)
## 'data.frame': 731 obs. of 16 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: int 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
1. cnt (total rentals): - Range: 22 to 8,714 rentals per day. - Median: 4,548 rentals, showing a relatively even distribution around the middle value.
2. season and yr: - Data spans all four seasons and two years (2011 and 2012).
3. Weather-related variables: - temp (normalized temperature): Ranges from 0.06 to 0.86, showing diverse weather conditions. - windspeed: Ranges from 0.02 to 0.50, indicating varying wind conditions.
4. Insights from Structure: - dteday is stored as a character type, which was converted to Date for time series analysis. - cnt, casual, and registered provide key demand data, with cnt being the sum of the latter two.
# Check for missing values in day_data
cat("\n--- Checking for Missing Values in day_data ---\n")
##
## --- Checking for Missing Values in day_data ---
colSums(is.na(day_data))
## instant dteday season yr mnth holiday weekday
## 0 0 0 0 0 0 0
## workingday weathersit temp atemp hum windspeed casual
## 0 0 0 0 0 0 0
## registered cnt
## 0 0
# Explore the Hourly Data
cat("\n--- Summary of hour_data ---\n")
##
## --- Summary of hour_data ---
summary(hour_data)
## instant dteday season yr
## Min. : 1 Length:17379 Min. :1.000 Min. :0.0000
## 1st Qu.: 4346 Class :character 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8690 Mode :character Median :3.000 Median :1.0000
## Mean : 8690 Mean :2.502 Mean :0.5026
## 3rd Qu.:13034 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :17379 Max. :4.000 Max. :1.0000
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
## Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
## Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
## Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
## workingday weathersit temp atemp
## Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
## Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
## hum windspeed casual registered
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
## Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
## Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
## Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
## cnt
## Min. : 1.0
## 1st Qu.: 40.0
## Median :142.0
## Mean :189.5
## 3rd Qu.:281.0
## Max. :977.0
cat("\n--- Structure of hour_data ---\n")
##
## --- Structure of hour_data ---
str(hour_data)
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
1. - Hourly data contains 17,379 records, offering detailed granularity. - cnt (total rentals) ranges from 1 to 977 rentals per hour. - hr: Represents each hour of the day, allowing for time-of-day analysis.
2. Insights from Structure: - hr provides an opportunity to explore hourly trends (e.g., peak times for rentals). - dteday in this dataset also required conversion to Date format.
# Check for missing values in hour_data
cat("\n--- Checking for Missing Values in hour_data ---\n")
##
## --- Checking for Missing Values in hour_data ---
colSums(is.na(hour_data))
## instant dteday season yr mnth hr holiday
## 0 0 0 0 0 0 0
## weekday workingday weathersit temp atemp hum windspeed
## 0 0 0 0 0 0 0
## casual registered cnt
## 0 0 0
# Check Unique Values for Key Categorical Columns in Daily Data
cat("\n--- Unique values in categorical variables (day_data) ---\n")
##
## --- Unique values in categorical variables (day_data) ---
cat("Seasons: ", unique(day_data$season), "\n")
## Seasons: 1 2 3 4
cat("Years: ", unique(day_data$yr), "\n")
## Years: 0 1
cat("Months: ", unique(day_data$mnth), "\n")
## Months: 1 2 3 4 5 6 7 8 9 10 11 12
cat("Weathersit: ", unique(day_data$weathersit), "\n")
## Weathersit: 2 1 3
# Visualize Distribution of Total Bike Rentals (cnt)
library(ggplot2)
ggplot(day_data, aes(x = cnt)) +
geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
labs(
title = "Distribution of Total Bike Rentals (cnt)",
x = "Total Bike Rentals",
y = "Frequency"
)
The distribution is roughly bell-shaped, with most daily rental counts falling between 2,500 and 7,500.
- There are a few outliers on the lower end, suggesting days with unusually low rentals (possibly due to extreme weather or holidays).
# Convert 'dteday' column to Date type
day_data$dteday <- as.Date(day_data$dteday)
# Analyze Seasonal and Temporal Trends in Daily Data
ggplot(day_data, aes(x = dteday, y = cnt)) +
geom_line(color = "blue") +
labs(
title = "Daily Bike Rentals Over Time",
x = "Date",
y = "Total Bike Rentals"
)
1. Upward Trend: A general increase in rentals from early 2011 to mid-2012, indicating growing demand for bike rentals.
2. Seasonality: - Peaks in warmer months (likely spring and summer) and dips in colder months (fall and winter).
3. Spikes: Sudden jumps may be due to events or anomalies, such as promotional campaigns or extreme weather.
# Check Correlation of Continuous Variables in Daily Data
numeric_cols <- sapply(day_data, is.numeric)
cor_matrix <- cor(day_data[, numeric_cols])
cat("\n--- Correlation Matrix (Daily Data) ---\n")
##
## --- Correlation Matrix (Daily Data) ---
print(cor_matrix)
## instant season yr mnth holiday
## instant 1.000000e+00 0.412224179 0.866025404 0.496701889 0.016144632
## season 4.122242e-01 1.000000000 -0.001844343 0.831440114 -0.010536659
## yr 8.660254e-01 -0.001844343 1.000000000 -0.001792434 0.007954311
## mnth 4.967019e-01 0.831440114 -0.001792434 1.000000000 0.019190895
## holiday 1.614463e-02 -0.010536659 0.007954311 0.019190895 1.000000000
## weekday -1.617914e-05 -0.003079881 -0.005460765 0.009509313 -0.101960269
## workingday -4.336537e-03 0.012484963 -0.002012621 -0.005900951 -0.253022700
## weathersit -2.147721e-02 0.019211028 -0.048726541 0.043528098 -0.034626841
## temp 1.505803e-01 0.334314856 0.047603572 0.220205335 -0.028555535
## atemp 1.526382e-01 0.342875613 0.046106149 0.227458630 -0.032506692
## hum 1.637471e-02 0.205444765 -0.110651045 0.222203691 -0.015937479
## windspeed -1.126196e-01 -0.229046337 -0.011817060 -0.207501752 0.006291507
## casual 2.752552e-01 0.210399165 0.248545664 0.123005889 0.054274203
## registered 6.596229e-01 0.411623051 0.594248168 0.293487830 -0.108744863
## cnt 6.288303e-01 0.406100371 0.566709708 0.279977112 -0.068347716
## weekday workingday weathersit temp atemp
## instant -1.617914e-05 -0.004336537 -0.02147721 0.1505803019 0.152638238
## season -3.079881e-03 0.012484963 0.01921103 0.3343148564 0.342875613
## yr -5.460765e-03 -0.002012621 -0.04872654 0.0476035719 0.046106149
## mnth 9.509313e-03 -0.005900951 0.04352810 0.2202053352 0.227458630
## holiday -1.019603e-01 -0.253022700 -0.03462684 -0.0285555350 -0.032506692
## weekday 1.000000e+00 0.035789674 0.03108747 -0.0001699624 -0.007537132
## workingday 3.578967e-02 1.000000000 0.06120043 0.0526598102 0.052182275
## weathersit 3.108747e-02 0.061200430 1.00000000 -0.1206022365 -0.121583354
## temp -1.699624e-04 0.052659810 -0.12060224 1.0000000000 0.991701553
## atemp -7.537132e-03 0.052182275 -0.12158335 0.9917015532 1.000000000
## hum -5.223210e-02 0.024327046 0.59104460 0.1269629390 0.139988060
## windspeed 1.428212e-02 -0.018796487 0.03951106 -0.1579441204 -0.183642967
## casual 5.992264e-02 -0.518044191 -0.24735300 0.5432846617 0.543863690
## registered 5.736744e-02 0.303907117 -0.26038771 0.5400119662 0.544191758
## cnt 6.744341e-02 0.061156063 -0.29739124 0.6274940090 0.631065700
## hum windspeed casual registered cnt
## instant 0.01637471 -0.112619556 0.27525521 0.65962287 0.62883027
## season 0.20544476 -0.229046337 0.21039916 0.41162305 0.40610037
## yr -0.11065104 -0.011817060 0.24854566 0.59424817 0.56670971
## mnth 0.22220369 -0.207501752 0.12300589 0.29348783 0.27997711
## holiday -0.01593748 0.006291507 0.05427420 -0.10874486 -0.06834772
## weekday -0.05223210 0.014282124 0.05992264 0.05736744 0.06744341
## workingday 0.02432705 -0.018796487 -0.51804419 0.30390712 0.06115606
## weathersit 0.59104460 0.039511059 -0.24735300 -0.26038771 -0.29739124
## temp 0.12696294 -0.157944120 0.54328466 0.54001197 0.62749401
## atemp 0.13998806 -0.183642967 0.54386369 0.54419176 0.63106570
## hum 1.00000000 -0.248489099 -0.07700788 -0.09108860 -0.10065856
## windspeed -0.24848910 1.000000000 -0.16761335 -0.21744898 -0.23454500
## casual -0.07700788 -0.167613349 1.00000000 0.39528245 0.67280443
## registered -0.09108860 -0.217448981 0.39528245 1.00000000 0.94551692
## cnt -0.10065856 -0.234544997 0.67280443 0.94551692 1.00000000
1. Strong Correlations: - cnt and registered: 0.95, indicating that most rentals are by registered users. - cnt and temp: 0.63, suggesting more rentals occur in favorable temperatures.
2. Weak/Negative Correlations: - cnt and windspeed: -0.23, as high wind speeds may deter biking. - cnt and holiday: -0.07, indicating fewer rentals on holidays.
# Visualize Correlation Matrix
library(ggcorrplot)
ggcorrplot(cor_matrix, lab = TRUE)
# Ensure 'dteday' is in Date format
day_data$dteday <- as.Date(day_data$dteday)
# Load necessary libraries
library(timetk)
library(plotly)
# Create an interactive time series plot for daily bike rentals
interactive_plot <- day_data %>%
select(dteday, cnt) %>%
plot_time_series(
.date_var = dteday,
.value = cnt,
.interactive = TRUE, # Enables interactivity
.title = "Interactive Time Series Plot: Daily Bike Rentals",
.x_lab = "Date",
.y_lab = "Total Rentals"
)
# Print the interactive plot
interactive_plot
1. Interactive visualization provides a clearer view of trends and patterns: - Seasonal peaks and troughs are evident. - Allows zooming into specific time periods for further analysis.
# Ensure 'dteday' is in Date format
day_data$dteday <- as.Date(day_data$dteday)
# Smooth the Time Series using Moving Average
library(dplyr)
day_data <- day_data %>%
mutate(
ma_7 = zoo::rollmean(cnt, k = 7, fill = NA, align = "right"), # 7-day moving average
ma_30 = zoo::rollmean(cnt, k = 30, fill = NA, align = "right") # 30-day moving average
)
# Plot Smoothed Data
library(ggplot2)
ggplot(day_data, aes(x = dteday)) +
geom_line(aes(y = cnt, color = "Original"), size = 0.5) +
geom_line(aes(y = ma_7, color = "7-Day Moving Average"), linewidth = 1) +
geom_line(aes(y = ma_30, color = "30-Day Moving Average"), linewidth = 1) +
labs(
title = "Smoothed Time Series: Daily Bike Rentals",
x = "Date",
y = "Total Rentals",
color = "Legend"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 29 rows containing missing values or values outside the scale range
## (`geom_line()`).
# Use Lowess Smoothing
lowess_fit <- lowess(day_data$dteday, day_data$cnt, f = 0.2) # Adjust 'f' for smoothness
# Add Lowess to the Data Frame
day_data$lowess <- lowess_fit$y
# Plot Lowess Smoothed Data
ggplot(day_data, aes(x = dteday)) +
geom_line(aes(y = cnt, color = "Original"), linewidth = 0.5) +
geom_line(aes(y = lowess, color = "Lowess Smoothing"), linewidth = 1) +
labs(
title = "Lowess Smoothed Time Series: Daily Bike Rentals",
x = "Date",
y = "Total Rentals",
color = "Legend"
) +
theme_minimal()
Description: 1. The plot shows the original daily bike rental data (blue line) alongside 7-day (green line) and 30-day (red line) moving averages.
Insights: 1. 7-Day Moving Average captures weekly fluctuations and short-term trends.
30-Day Moving Average smooths out the data further, providing a clearer view of long-term trends.
Seasonal trends are visible, with peaks during warm months and troughs during colder months.
Description: 1. The plot overlays a Lowess (locally weighted smoothing) line (red) on the original data (blue).
Insights: 1. The Lowess curve provides a flexible, non-linear fit, highlighting the overall pattern without assuming linear relationships. 2. It confirms the seasonality and long-term growth trends, as well as the dip during winter months.
# Convert the data into a time series object
library(stats)
bike_ts <- ts(day_data$cnt, start = c(2011, 1), frequency = 365) # Daily data
# Decompose the time series
decomposed_ts <- decompose(bike_ts, type = "multiplicative")
# Plot the decomposition (no `main` argument)
plot(decomposed_ts)
# title(main = "Decomposition of Daily Bike Rentals")
Description: 1. The decomposition splits the time series into observed, trend, seasonal, and random (residual) components.
Insights: 1. Trend: Shows a gradual increase over time, indicating growing bike rental demand. 2. Seasonality: Repeats annually, with higher demand in warmer months and lower in colder ones. 3. Residuals: Fluctuations that cannot be explained by trend or seasonality, likely due to events or anomalies.
# Load necessary library
library(tseries)
# Perform Augmented Dickey-Fuller Test
adf_test <- adf.test(bike_ts, alternative = "stationary")
# Output the results of the ADF Test
cat("\n--- Augmented Dickey-Fuller Test Results ---\n")
##
## --- Augmented Dickey-Fuller Test Results ---
print(adf_test)
##
## Augmented Dickey-Fuller Test
##
## data: bike_ts
## Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
## alternative hypothesis: stationary
# Interpret the results
if (adf_test$p.value < 0.05) {
cat("The time series is stationary (p-value < 0.05).\n")
} else {
cat("The time series is NOT stationary (p-value >= 0.05). Consider differencing or transformations.\n")
}
## The time series is NOT stationary (p-value >= 0.05). Consider differencing or transformations.
# Visualize the differenced data if not stationary
if (adf_test$p.value >= 0.05) {
diff_bike_ts <- diff(bike_ts)
# Re-check stationarity after differencing
adf_diff_test <- adf.test(diff_bike_ts, alternative = "stationary")
cat("\n--- Augmented Dickey-Fuller Test Results After Differencing ---\n")
print(adf_diff_test)
# Plot the differenced time series
plot(diff_bike_ts, main = "Differenced Time Series", ylab = "Differenced Rentals", xlab = "Time")
}
## Warning in adf.test(diff_bike_ts, alternative = "stationary"): p-value smaller
## than printed p-value
##
## --- Augmented Dickey-Fuller Test Results After Differencing ---
##
## Augmented Dickey-Fuller Test
##
## data: diff_bike_ts
## Dickey-Fuller = -13.798, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
Description: 1. The plot shows the first-differenced time series to remove trends and seasonality, making the data stationary.
Insights: 1. After differencing, the time series oscillates around zero, indicating that trends have been removed. 2. This is crucial for applying ARIMA models, which require stationarity.
# Load the necessary library
library(forecast)
# Step 1: Fit an ARIMA Model
# Apply the auto.arima() function to find the best ARIMA model for the differenced time series
arima_model <- auto.arima(bike_ts, seasonal = TRUE, stepwise = TRUE, approximation = FALSE)
# Output the summary of the fitted model
cat("\n--- Summary of ARIMA Model ---\n")
##
## --- Summary of ARIMA Model ---
summary(arima_model)
## Series: bike_ts
## ARIMA(2,0,1)(0,1,0)[365] with drift
##
## Coefficients:
## ar1 ar2 ma1 drift
## 1.2100 -0.2379 -0.8801 5.7039
## s.e. 0.0716 0.0615 0.0465 0.7707
##
## sigma^2 = 1597015: log likelihood = -3131.46
## AIC=6272.92 AICc=6273.09 BIC=6292.43
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 5.293444 889.3036 456.5587 -44.23962 51.68411 0.1965677
## ACF1
## Training set 0.006346344
# Step 2: Check Residual Diagnostics
# Plot diagnostics to check residuals of the ARIMA model
cat("\n--- Residual Diagnostics for ARIMA Model ---\n")
##
## --- Residual Diagnostics for ARIMA Model ---
checkresiduals(arima_model)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,1)(0,1,0)[365] with drift
## Q* = 369.58, df = 143, p-value < 2.2e-16
##
## Model df: 3. Total lags used: 146
Description: 1. The diagnostics evaluate the quality of the ARIMA model by examining residuals. 2. Top-left: Residuals over time. 3. Top-right: Density plot of residuals. 4. Bottom-left: ACF (autocorrelation function) of residuals.
Insights: 1. The residuals appear randomly distributed around zero, indicating no significant patterns left unexplained by the model. 2. The ACF plot shows no significant autocorrelations, suggesting the residuals are white noise and the ARIMA model fits well.
# Step 3: Forecast Future Bike Rentals
# Forecast the next 30 days
forecast_arima <- forecast(arima_model, h = 30)
# Plot the forecast
cat("\n--- Forecast Plot ---\n")
##
## --- Forecast Plot ---
plot(forecast_arima, main = "Forecast of Daily Bike Rentals", xlab = "Time", ylab = "Total Rentals")
Description: 1. The plot shows the forecasted values (blue line) for the next 30 days, along with confidence intervals (shaded areas).
Insights: 1. The forecast follows the observed seasonal pattern, with expected demand increasing during warmer months. 2. Wider confidence intervals reflect uncertainty in the predictions, particularly as the forecast horizon extends.
# Step 4: Print Forecasted Values
cat("\n--- Forecasted Values for the Next 30 Days ---\n")
##
## --- Forecasted Values for the Next 30 Days ---
print(forecast_arima)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2013.0027 2693.869 1074.3328 4313.404 217.002392 5170.735
## 2013.0055 3089.424 1384.0409 4794.807 481.265674 5697.582
## 2013.0082 3281.965 1556.6908 5007.238 643.386079 5920.543
## 2013.0110 4232.921 2497.3261 5968.516 1578.557683 6887.284
## 2013.0137 5101.337 3357.7697 6844.905 2434.780695 7767.894
## 2013.0164 5564.492 3813.9083 7315.075 2887.205492 8241.778
## 2013.0192 4506.988 2750.0050 6263.972 1819.914184 7194.063
## 2013.0219 3495.018 1732.1393 5257.896 798.927705 6191.108
## 2013.0247 4752.666 2984.3428 6520.989 2048.248937 7457.083
## 2013.0274 3365.992 1592.6357 5139.348 653.877526 6078.107
## 2013.0301 5319.047 3541.0360 7097.058 2599.813987 8038.280
## 2013.0329 4467.878 2685.5617 6250.194 1742.060616 7193.695
## 2013.0356 3777.530 1991.2311 5563.829 1045.621532 6509.439
## 2013.0384 3625.048 1835.0632 5415.033 887.502503 6362.593
## 2013.0411 3640.473 1847.0772 5433.869 897.710729 6383.235
## 2013.0438 4304.846 2508.2925 6101.399 1557.254643 7052.437
## 2013.0466 4772.205 2972.7290 6571.682 2020.143728 7524.267
## 2013.0493 4713.589 2911.4063 6515.772 1957.388354 7469.790
## 2013.0521 4609.033 2804.3443 6413.722 1848.999634 7369.067
## 2013.0548 2770.572 963.5626 4577.582 6.989297 5534.156
## 2013.0575 3469.240 1660.0808 5278.400 702.369575 6236.111
## 2013.0603 3946.069 2134.9183 5757.220 1176.153104 6715.985
## 2013.0630 5874.090 4061.0945 7687.085 3101.352837 8646.826
## 2013.0658 5825.332 4010.6281 7640.036 3049.981964 8600.682
## 2013.0685 5649.825 3833.5381 7466.112 2872.053843 8427.596
## 2013.0712 5049.597 3231.8427 6867.350 2269.581943 7829.611
## 2013.0740 5634.673 3815.5601 7453.786 2852.579787 8416.767
## 2013.0767 4872.081 3051.7080 6692.453 2088.060905 7656.100
## 2013.0795 5269.844 3448.3038 7091.384 2484.038794 8055.649
## 2013.0822 6170.986 4348.3645 7993.608 3383.526883 8958.445
Description: 1. A tabular output of the forecasted daily bike rentals, including 80% and 95% confidence intervals.
Insights: 1. The point forecast provides the predicted daily rentals. 2. The confidence intervals indicate the range within which the true value is likely to fall: - 80% interval is narrower and less conservative. - 95% interval is wider and more conservative.
Based on the extensive analysis of the bike rental data from 2011 to 2012, several key insights have emerged:
This analysis underscores the importance of understanding seasonal patterns, user behavior, and environmental factors in optimizing bike rental operations. By leveraging the insights and implementing the recommendations above, the company can enhance its service, improve user satisfaction, and achieve operational efficiency while maximizing profitability.